1 Reminders

  1. How to load a tool?
  2. How to display help of a command or a tool?
  3. How to submit jobs?
Prerequites
  1. module allows you to load a tool
  2. man or options --help, -h (RTFM!)
  3. srun, sbatch (IFB documentation)

2 Preparation of your working directory

  1. Go to your home directory
  2. Create a directory called M5 (i.e Module5) and move in
  3. Create this directory structure

You must see this when you use the tree command:

.
├── CLEANING
├── FASTQ
├── MAPPING
└── QC
Step by step correction
mkdir -p ~/M5/FASTQ  # -p: no error if existing, make parent directories as needed
mkdir -p ~/M5/CLEANING
mkdir -p ~/M5/MAPPING
mkdir -p ~/M5/QC
cd ~/M5
tree ~/M5 # list contents of directories in a tree-like format.

3 Get raw data

  1. Download the raw data (Illumina) associated with this article (Allué-Guardia, Nyong, Koenig, Vargas, Bono, and Eppinger, 2019). You can use wget, fasterq-dump or sra-tools
  2. Compress them with gzip

You have to have these files in the FASTQ directory

ls -ltrh ~/M5/FASTQ/
total 236M
-rw-rw-r-- 1 orue orue 127M  6 mars  12:32 SRR8082143_2.fastq.gz
-rw-rw-r-- 1 orue orue 109M  6 mars  12:32 SRR8082143_1.fastq.gz
Step by step correction
  • In the “Data availability” section, extract the accession for Illumina data : SRX4909245
  • Explore SRA and ENA

Get the data by the method of your choice: - use wget or fasterq-dump from sra-tools

  • Using wget :
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR808/003/SRR8082143/SRR8082143_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR808/003/SRR8082143/SRR8082143_2.fastq.gz
  • Using sra-toolkit
module load  sra-tools
srun fasterq-dump -S -p SRR8082143 --outdir . --threads 1
  • enaBrowserTool is also available

Then compress the files:

gzip *.fastq
Go further
  • What is the best method if you have hundreds of files to download

4 Quality control

  1. Launch FastQC [1] on the paired-end FastQ files of the sample you previously downloaded and write results in QC directory (use 8 threads)
  2. Explore the results and interpret the graphics

You have to obtain these files:

ls -ltrh ~/M5/QC
total 1,9M
-rw-rw-r-- 1 orue orue 321K  6 mars  13:23 SRR8082143_1_fastqc.zip
-rw-rw-r-- 1 orue orue 642K  6 mars  13:23 SRR8082143_1_fastqc.html
-rw-rw-r-- 1 orue orue 333K  6 mars  13:23 SRR8082143_2_fastqc.zip
-rw-rw-r-- 1 orue orue 642K  6 mars  13:23 SRR8082143_2_fastqc.html
Step by step correction
cd ~
module load fastqc
srun --cpus-per-task 8 fastqc FASTQ/SRR8082143_1.fastq.gz -o QC/ -t 8
srun --cpus-per-task 8 fastqc FASTQ/SRR8082143_2.fastq.gz -o QC/ -t 8

5 Reads cleaning with fastp

  1. Launch fastp on the paired-end FastQ files of the sample you previously downloaded
  • Detect and Remove the classical Illumina adapters
  • Filter reads with :
    • mean quality >= 20 on a sliding window of 4
    • 40% of the bases with a quality >= 15
    • length of the trimmed read >= 100
  1. Inspect the results
  • How many reads are filtered ?
  • Where do fastp store its reports. Is it configurable ?
Step by step correction
module load fastp
cd ~/M5
srun --cpus-per-task 8 fastp --in1 FASTQ/SRR8082143_1.fastq.gz --in2 FASTQ/SRR8082143_2.fastq.gz -l 100 --out1 CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz --out2 CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz --unpaired1 CLEANING/SRR8082143_singles.fastq.gz --unpaired2 CLEANING/SRR8082143_singles.fastq.gz -w 1 -j CLEANING/fastp.json -h CLEANING/fastp.html -t 8
ls -ltrh ~/M5/CLEANING/
total 245M
-rw-rw-r-- 1 orue orue 113M  6 mars  12:59 SRR8082143_1.cleaned_filtered.fastq.gz
-rw-rw-r-- 1 orue orue 162K  6 mars  12:59 fastp.json
-rw-rw-r-- 1 orue orue 525K  6 mars  12:59 fastp.html
-rw-rw-r-- 1 orue orue 2,2M  6 mars  12:59 SRR8082143_singles.fastq.gz
-rw-rw-r-- 1 orue orue 130M  6 mars  12:59 SRR8082143_2.cleaned_filtered.fastq.gz

6 MultiQC

  1. Run MultiQC to obtain a report with fastqc and fastp results
ls -ltrh ~/M5/CLEANING/
total 248M
-rw-rw-r-- 1 orue orue 113M  6 mars  12:59 SRR8082143_1.cleaned_filtered.fastq.gz
-rw-rw-r-- 1 orue orue 162K  6 mars  12:59 fastp.json
-rw-rw-r-- 1 orue orue 525K  6 mars  12:59 fastp.html
-rw-rw-r-- 1 orue orue 2,2M  6 mars  12:59 SRR8082143_singles.fastq.gz
-rw-rw-r-- 1 orue orue 130M  6 mars  12:59 SRR8082143_2.cleaned_filtered.fastq.gz
-rw-rw-r-- 1 orue orue 1,2M  6 mars  13:28 multiqc_report.html
drwxrwxr-x 2 orue orue 2,0M  6 mars  13:28 multiqc_data
Step by step correction
cd ~/M5
module load multiqc
multiqc -d . -o CLEANING

7 Mapping with bwa

  1. Map the reads to the reference genome /shared/projects/dubii2020/data/module5/seance1/CP031214.1.fasta with bwa
ls -ltrh ~/M5/MAPPING/
total 249M
-rw-rw-r-- 1 orue orue 249M  6 mars  13:01 SRR8082143.bam
Step by step correction
cd ~/M5
module load bwa
## srun bwa index sequence.fasta
srun --cpus-per-task=33 bwa mem /shared/projects/dubii2020/data/module5/seance1/CP031214.1.fasta CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz -t 32 | samtools view -hbS - > MAPPING/SRR8082143.bam

References

1. Andrews S. FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

 

A work by Migale Bioinformatics Facility

https://migale.inrae.fr

Our two affiliations to cite us:

Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France

Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France